Generate external transform wrappers using a script #29834

ahmedabu98 · 2023-12-20T12:59:04Z

Implementing a script that generates wrappers for external SchemaTransforms, according to Option #3 in the following design doc: https://s.apache.org/autogen-wrappers

The script's workflow takes place in setup.py, which can be invoked for local setup or for building the SDK for a Beam release. Files are generated in a subdirectory that is ignored by git. From there, the wrappers can be imported into relevant __init__.py files.

Wrappers are generated along with any documentation provided from the underlying SchemaTransform, and is in compliance with existing linting and pydoc rules.

With the expansion service YAML config, the following transform YAML config is generated:

Transform YAML config

- default_service: sdks:java:io:expansion-service:shadowJar
  description: 'Outputs a PCollection of Beam Rows, each containing a single INT64
    number called "value". The count is produced from the given "start"value and either
    up to the given "end" or until 2^63 - 1.

    To produce an unbounded PCollection, simply do not specify an "end" value. Unbounded
    sequences can specify a "rate" for output elements.

    In all cases, the sequence of numbers is generated in parallel, so there is no
    inherent ordering between the generated values'
  destinations:
    python: apache_beam/io
  fields:
    end:
      description: The maximum number to generate (exclusive). Will be an unbounded
        sequence if left unspecified.
      nullable: true
      type: numpy.int64
    rate:
      description: Specifies the rate to generate a given number of elements per a
        given number of seconds. Applicable only to unbounded sequences.
      nullable: true
      type: Row(seconds=typing.Union[numpy.int64, NoneType], elements=<class 'numpy.int64'>)
    start:
      description: The minimum number to generate (inclusive).
      nullable: false
      type: numpy.int64
  identifier: beam:schematransform:org.apache.beam:generate_sequence:v1
  name: GenerateSequence

From this config, external transform wrappers are created and stored in appropriate modules. For example, our config gives us the following apache_beam/transforms/xlang/io.py file:

GenerateSequence wrapper

# NOTE: This file contains autogenerated external transform(s)
# and should not be edited by hand.
# Refer to gen_xlang_wrappers.py for more info.

"""Cross-language transforms in this module can be imported from the
:py:mod:`apache_beam.io` package."""

# pylint:disable=line-too-long

from apache_beam.transforms.external import BeamJarExpansionService
from apache_beam.transforms.external_transform_provider import ExternalTransform


class GenerateSequence(ExternalTransform):
  """
  Outputs a PCollection of Beam Rows, each containing a single INT64 number
  called "value". The count is produced from the given "start" value and either
  up to the given "end" or until 2^63 - 1.
  To produce an unbounded PCollection, simply do not specify an "end" value.
  Unbounded sequences can specify a "rate" for output elements.
  In all cases, the sequence of numbers is generated in parallel, so there is no
  inherent ordering between the generated values
  """
  identifier = "beam:schematransform:org.apache.beam:generate_sequence:v1"

  def __init__(self, start, end=None, rate=None, expansion_service=None):
    """
    :param start: (numpy.int64)
      The minimum number to generate (inclusive). 
    :param end: (numpy.int64)
      The maximum number to generate (exclusive). Will be an unbounded
      sequence if left unspecified. 
    :param rate: (Row(seconds=typing.Union[numpy.int64, NoneType], elements=<class 'numpy.int64'>))
      Specifies the rate to generate a given number of elements per a given
      number of seconds. Applicable only to unbounded sequences. 
    """
    self.default_expansion_service = BeamJarExpansionService(
        "sdks:java:io:expansion-service:shadowJar")
    super().__init__(
        start=start, end=end, rate=rate, expansion_service=expansion_service)

Including documentation for how the script is used, and unit tests for different parts of the script.

Adding a gradle command ./gradlew generateExternalTransformsConfig to build the configs (it takes care of building the relevant expansion jars beforehand). Once th

Also adding a PreCommit test that generates the transform config from scratch and compares it with the existing one. This serves to indicate whether a change will render the existing config out of sync. To resolve, one would re-generate the config (with ./gradlew generateExternalTransformsConfig) and commit the changes.

…wrappers_script

…ice config

…IO direct GHA workflow

…exist

github-actions · 2023-12-22T01:28:40Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @jrmccluskey for label python.
R: @damccorm for label build.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

sdks/python/container/py39/base_image_requirements.txt

damccorm · 2023-12-26T14:54:35Z

R: @robertwb @tvalentyn @chamikaramj

(manually requesting to make the review bot happy)

github-actions · 2023-12-26T14:55:41Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

.github/workflows/beam_PreCommit_Xlang_Generated_Transforms.yml

robertwb · 2024-02-12T22:39:21Z

buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy

+              // setup test env
+              def serviceArgs = project.project(':sdks:python').mapToArgString(expansionServiceOpts)
+              executable 'sh'
+              args '-c', ". ${project.ext.envdir}/bin/activate && $pythonDir/scripts/run_expansion_services.sh stop --group_id ${project.name} && $pythonDir/scripts/run_expansion_services.sh start $serviceArgs"


If we don't do this now, could you at least file a bug and drop a TODO.

sdks/python/setup.py

sdks/standard_expansion_services.yaml

robertwb · 2024-02-12T23:24:12Z

Ah, yes, that's right. Hopefully the file generation itself isn't *that* slow.

…

On Mon, Feb 12, 2024 at 3:02 PM Ahmed Abualsaud ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sdks/python/setup.py <#29834 (comment)>: > + # if exists, this directory will have at least its __init__.py file + if (not os.path.exists(generated_transforms_dir) or + len(os.listdir(generated_transforms_dir)) <= 1): + message = 'External transform wrappers have not been generated ' + if not script_exists: + message += 'and the generation script `gen_xlang_wrappers.py`' + if not config_exists: + message += 'and the standard external transforms config' + message += ' could not be found' + warnings.warn(message) + else: + warnings.warn( + 'Skipping external transform wrapper generation as they ' + 'are already generated.') + return + out = subprocess.run([ This will make pretty much every run of setup.py slow, right? Not anymore, since we decided to decouple the generation steps. setup.py expects the transform config to already exist, so no expansion + discovery is done here. All it does is generate files based on the existing config. — Reply to this email directly, view it on GitHub <#29834 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADWVAJPRS6PTJHJGTPTHE3YTKNPZAVCNFSM6AAAAABA4ZG2QSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQNZWGUYTSNZSGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

tvalentyn · 2024-02-13T23:39:37Z

I think we need to put the import back in the try:, except: pass block?

That will work, alternatively, you can have dedicated functions: to generate config, and to generate wrappers, and make necessary imports only in functions where imports are required; you might have to add # pylint: disable=g-import-not-at-top.

ahmedabu98 · 2024-02-14T15:57:36Z

@tvalentyn this is already the case, the steps are in different functions. e.g. we're only importing apache_beam when using the function to generate transform configs.
But this function is unusable in a clean setup (one that doesn't have generated modules) unless we make imports like from apache_beam.transforms.xlang.io import * optional.

sdks/python/apache_beam/io/__init__.py

…lly starting each one beforehand; skip existing transforms from gcp service config; cleanup precommit test config

into gen_wrappers_script

…suites, where expansion service wouldn't be built

…path, use simple function to return beam jar exp service

ahmedabu98 · 2024-02-16T16:24:19Z

The newly added precommit that performs a transform config sync check (beam_PreCommit_Xlang_Generated_Transforms) is running green on my fork: https://github.com/ahmedabu98/beam/actions/runs/7933231703/

Will merge after current tests pass

…wrappers_script

into gen_wrappers_script

ahmedabu98 added 3 commits December 13, 2023 23:40

checkpoint

f80f1ac

Merge branch 'master' of https://github.com/ahmedabu98/beam into gen_…

2517445

…wrappers_script

gen_xlang_wrappers workflow

e5bf704

github-actions bot added python build labels Dec 20, 2023

ahmedabu98 changed the title ~~Generate Python wrappers for external transforms in setup.py~~ Generate external transform wrappers using a script Dec 20, 2023

ahmedabu98 added 6 commits December 20, 2023 22:09

add tests; integrate with setup.py

9295861

remove duplicate changes; adjust transform positions in standard serv…

1ec6c11

…ice config

undo deleted line

f9bbf42

add more config modifications

6a2d3d2

warn when generation script not found

23450fd

add jinja2 dependency; include script in MANIFEST.in; create a xlang …

01dbc63

…IO direct GHA workflow

github-actions bot added the docker label Dec 20, 2023

ahmedabu98 added 7 commits December 21, 2023 01:51

add MarkupSafe==2.1.3 because jinja2 needs it

0bc4ff4

lints and fixes

390d8f8

jinja template abides more by lint/format rules in case yapf doesn't …

45a71ea

…exist

lint and fixes

8847562

no yapf; use random dir name for tests

18a157d

lint

c64d75b

lint

7049f80

ahmedabu98 marked this pull request as ready for review December 22, 2023 00:11

ahmedabu98 requested review from robertwb, tvalentyn and chamikaramj December 22, 2023 00:13

github-actions bot added the Next Action: Reviewers label Dec 22, 2023

format fix

ddd5b2f

ahmedabu98 commented Dec 22, 2023

View reviewed changes

sdks/python/container/py39/base_image_requirements.txt Outdated Show resolved Hide resolved

robertwb reviewed Feb 12, 2024

View reviewed changes

robertwb approved these changes Feb 12, 2024

View reviewed changes

tvalentyn reviewed Feb 14, 2024

View reviewed changes

sdks/python/apache_beam/io/__init__.py Show resolved Hide resolved

let python automatically start up expansion services instead of manua…

9568de5

…lly starting each one beforehand; skip existing transforms from gcp service config; cleanup precommit test config

github-actions bot added the gcp label Feb 15, 2024

ahmedabu98 and others added 8 commits February 15, 2024 15:30

remove merge conflict

1f49d75

Merge branch 'master' into gen_wrappers_script

bdf290f

touch postcommit files to trigger GHA

c96ade5

Merge branch 'gen_wrappers_script' of https://github.com/ahmedabu98/beam

082295d

into gen_wrappers_script

rename to external_provider_it_test.py to avoid running on unit test …

28146bd

…suites, where expansion service wouldn't be built

rename tests ..Test -> ..IT

5f9a227

small adjustments to pass python unit tests: import script from file …

06d5fd5

…path, use simple function to return beam jar exp service

run tests only when expansion jars are built

e42b3fe

github-actions bot added the extensions label Feb 16, 2024

ahmedabu98 added 2 commits February 16, 2024 10:45

load the script after importing

0d11059

skip test if jars not built

c1b29f3

Merge branch 'master' of https://github.com/ahmedabu98/beam into gen_…

620344c

…wrappers_script

github-actions bot removed the extensions label Feb 17, 2024

ahmedabu98 and others added 4 commits February 21, 2024 15:29

Merge branch 'master' into gen_wrappers_script

8fdb02a

touch postcommit files to trigger GHA

35da8b9

Merge branch 'gen_wrappers_script' of https://github.com/ahmedabu98/beam

bc43578

into gen_wrappers_script

correct command name (generateExternalTransformsConfig)

89e139e

ahmedabu98 merged commit 11f9bce into apache:master Feb 22, 2024
108 checks passed

This was referenced Feb 29, 2024

Fix hdfs integration test #30458

Merged

[Failing Test]: Python PostCommit failing hdfsIntegrationTest in generate_external_transform_wrappers #30459

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate external transform wrappers using a script #29834

Generate external transform wrappers using a script #29834

ahmedabu98 commented Dec 20, 2023 •

edited

Loading

github-actions bot commented Dec 22, 2023

damccorm commented Dec 26, 2023

github-actions bot commented Dec 26, 2023

robertwb Feb 12, 2024

robertwb commented Feb 12, 2024 via email

tvalentyn commented Feb 13, 2024

ahmedabu98 commented Feb 14, 2024 •

edited

Loading

ahmedabu98 commented Feb 16, 2024

Generate external transform wrappers using a script #29834

Generate external transform wrappers using a script #29834

Conversation

ahmedabu98 commented Dec 20, 2023 • edited Loading

github-actions bot commented Dec 22, 2023

damccorm commented Dec 26, 2023

github-actions bot commented Dec 26, 2023

robertwb Feb 12, 2024

Choose a reason for hiding this comment

robertwb commented Feb 12, 2024 via email

tvalentyn commented Feb 13, 2024

ahmedabu98 commented Feb 14, 2024 • edited Loading

ahmedabu98 commented Feb 16, 2024

ahmedabu98 commented Dec 20, 2023 •

edited

Loading

ahmedabu98 commented Feb 14, 2024 •

edited

Loading